knitr::opts_chunk$set(echo = TRUE)

A Data Science Workflow

The objective of this workflow example is to combine a series of Data Science skills. We will be going through all the stages of a typical Social Science project from scraping data, to constructing databases, to visualizing and interpreting that data.

For simplicitiy, I divide this workflow into sections, where I demonstrate a key skill with a specifc purpose. Of course, in the real world such steps are interchangeable and can be done many different ways!

For this project, we are going to gather information about research universities in the US, which are ranked as R1 (Very High Research Activity) and R2 (High Research Activity). We will gather information about these from different sources, save it into tables and upload them to a relational database. From our database, we can then use SQL commands to conduct fast, simple analysis.

Without further adue, let us run the command to create a database. Note that if you’re working in R markdown, this will create the database in the directory where your .Rmd file is saved.

# Here, we'll need the DBI, and the RSQLite package
library(RSQLite)
library(DBI)

# creating the database with 
demo_db <- DBI::dbConnect(RSQLite::SQLite(), "demo_db.db")

# checking that the database created exists 
file.exists("demo_db.db")
## [1] TRUE

The above code snippet should return “TRUE” when you run it on your machine. It will have created the database, which is now empty, on your hard-drive. We’ll be scraping/downloading the data now, and adding it to our database using a primary key as we go.

Scraping R1 & R2 Universities’ Wikipedia page

Here, I’ll be writing an automatic function that will scrape this page. Note that as always, when scraping pages like this, we’re always susceptible to changes in the layout, content of the web-page. This is particularly the case for Wikipedia, where users can freely edit at any time and revert any changes. For this reason, we aim to scrape once and save our data immediately!

We’ll be collecting:

Upon close inspection, the Wikipedia page is made up of 3 large tables, R1, R2, R3 level univerisies. But we only care about R1 and R2 institutions here. I make a function that takes no argument because I’ll only be able to use it for this wikipedia page, due to its unique layout, but of course you could make one with generic arguments that you can keep in your repository and re-use!

Here’s the scraper:

# here we'll need the tidyverse package to make a tibble
# and the xml2 package to process the selectors of the different parts of the website
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.0     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.1.8
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(xml2)
library(rvest)
## 
## Attaching package: 'rvest'
## 
## The following object is masked from 'package:readr':
## 
##     guess_encoding
uni_scraper <- function() {
  
  # Specify the URL of the Wikipedia page
  uni_url <- "https://en.wikipedia.org/wiki/List_of_research_universities_in_the_United_States"
  # Read the HTML content from the specified URL
  page <- read_html(uni_url)
  
  # Extract all tables containing information about research universities
  universities_tables <- html_elements(page, css = "table.wikitable")
  
  # Initialize empty vectors to store the data
  names <- c()
  status <- c()
  city <- c()
  state <- c()
  urls <- c()
  
  # Loop through each table (but limit to tables 1-2 since the others are not R1 or R2 research centers)
  for (table_index in seq_along(universities_tables[1:2])) {
    
    universities_table <- universities_tables[[table_index]]
    
    # Extract the rows from the table
    rows <- html_elements(universities_table, "tr")
    
    # Loop through each row starting from the second row (skipping header)
    for (i in 2:length(rows)) {
      row <- rows[i]
      
      # Extract columns from the row
      columns <- html_elements(row, "td")
      
      # Extract data and append to the respective vectors
      names <- c(names, html_text(columns[1]))
      status <- c(status, html_text(columns[2]))
      city <- c(city, html_text(columns[3]))
      state <- c(state, html_text(columns[4]))
      
      # Extract the URL from the first column
      url_element <- html_element(row, "a") %>% html_attr("href")
      full_url <- ifelse(is.na(url_element), NA, paste("https://en.wikipedia.org",  url_element, sep = ""))
      urls <- c(urls, full_url)
    }
  }
  
  # Create a data frame with the collected data
  universities_data <- tibble(
    Name = names,
    Status = status,
    City = city,
    State = state,
    URL = urls)
  
  return(universities_data)
}

Now that we made the function, we’ll need to call it, so that it actually executes. Note that it is good practice to keep this function separate from its call in a Rmarkdown file so that when you knit, you avoid re-scraping all the data.

# Call the scraping function
R1R2_table <- uni_scraper()
head(R1R2_table)
## # A tibble: 6 × 5
##   Name                     Status               City          State  URL        
##   <chr>                    <chr>                <chr>         <chr>  <chr>      
## 1 Arizona State University Public               Tempe         "AZ\n" https://en…
## 2 Auburn University        Public               Auburn        "AL\n" https://en…
## 3 Baylor University        Private (non-profit) Waco          "TX\n" https://en…
## 4 Binghamton University    Public               Vestal        "NY\n" https://en…
## 5 Boston College           Private (non-profit) Chestnut Hill "MA\n" https://en…
## 6 Boston University        Private (non-profit) Boston        "MA\n" https://en…

And that’s it! We’ve saved all this information in a table.

Now let’s navigate to individual university pages, following the hyperlinks in the tables, and gather:

Of course, we could have done this in the same function, but for modularity purposes I do it in two functions. (And as you can see, this second one is already long enough on its own!)

scrape_university_extra <- function() {
  # Read the HTML content from the specified URL
  url <- "https://en.wikipedia.org/wiki/List_of_research_universities_in_the_United_States"
  page <- read_html(url)
  
  # Extract all tables containing information about universities
  universities_tables <- html_elements(page, css = "table.wikitable")
  
  # Initialize empty vectors to store the data
  names <- c()
  total_students <- c()
  endowment <- c()
  coordinates <- c()
  
  # Loop through the first two tables
  for (table_index in seq_along(universities_tables[1:2])) {
    universities_table <- universities_tables[[table_index]]
    
    # Extract the rows from the table
    rows <- html_elements(universities_table, "tr")
    
    # Loop through each row starting from the second row (skipping header)
    for (i in 2:length(rows)) {
      row <- rows[i]
      
      # Extract columns from the row
      columns <- html_elements(row, "td")
      
      # Extract university name and URL
      name <- html_text(columns[1])
      url_element <- html_element(columns[1], "a")
      university_url <- if (!is.null(url_element)) url_element %>% html_attr("href") else NA
      full_url <- ifelse(is.na(university_url), NA, paste("https://en.wikipedia.org", university_url, sep = ""))
      
      # Navigate to the university's dedicated page
      university_page <- read_html(full_url)
      
      #scrape the coordinates (lat and long. simultaneously)
      coordinates_value <- university_page %>% 
              html_element(css =".geo-dms") %>% 
              html_text()
      
      #scrape the endowment value but keep only the USD value
      endowment_value <- university_page %>% 
              html_element(css = "table.infobox th:contains('Endowment') + td") %>%
              html_text() %>%
              gsub("\\s*[\\(\\[].*", "", .)
      
      #scrape total number of students, but remove unneccesary information (different campuses etc.)
      total_students_value <- university_page %>% 
        html_element(css= "table.infobox th:contains('Students') + td") %>% 
        html_text() %>%
        str_replace_all("\\[[^\\]]*\\]|\\([^\\)]*\\)|[a-zA-Z]", " ")
      
      total_students_value <- gsub(",", "", total_students_value)
      
      if(!is.na(total_students_value) && any(str_detect(total_students_value, "\\s+"))){
        individual_numbers <- as.numeric(strsplit(total_students_value, "\\s+")[[1]])
        total_students_value <- sum(individual_numbers)
      }else{
        total_students_value <- as.numeric(total_students_value)
      }
      
      #Append the data to the vectors
      names <- c(names, name)
      coordinates <- c(coordinates, coordinates_value)
      endowment <- c(endowment, endowment_value)
      total_students <- c(total_students, total_students_value)
    }
  }
  
  # Create a tibble with the collected data using tidyverse
  universities_data <- tibble(
    Name = names,
    Coordinates = coordinates,
    Endowment = endowment,
    TotalStudents = total_students)
  
  return(universities_data)
}

Now let’s call our function (this might take a while given all the pages we have to go to), and again save the results into a table. That way, we can merge them later.

# Call the scraping function
university_info_extra <- scrape_university_extra()

# I use head again here to avoid repeating the entire table
head(university_info_extra)
## # A tibble: 6 × 4
##   Name                     Coordinates            Endowment       TotalStudents
##   <chr>                    <chr>                  <chr>                   <dbl>
## 1 Arizona State University 33°25′15″N 111°56′02″W $1.47 billion          142029
## 2 Auburn University        32°36′11″N 85°29′10″W  $1.05 billion           33015
## 3 Baylor University        31°32′53″N 97°06′58″W  $1.97 billion           20626
## 4 Binghamton University    42°05′20″N 75°58′01″W  $ 148.1 million         18148
## 5 Boston College           42°20′06″N 71°10′13″W  $3.3 billion            15106
## 6 Boston University        42°20′56″N 71°06′01″W  $3.2 billion            36729

You might get, like I did, a message saying NAs introduced by coercion. In this case, that’s okay. This is because Wikipedia pages aren’t identical for all the universities, and some information might be in different places. In a Data Science project, you might want to manually identify these and collect the data you need. But in this case, we’ll just be leaving the NAs in our final table.

Now that we have our two tables, we can merge them with a left-join command.

# merge the two table outputs from the two functions into one
merged_uni_table <- left_join(university_info_extra, R1R2_table, by = "Name")
head(merged_uni_table)
## # A tibble: 6 × 8
##   Name                     Coordinates  Endow…¹ Total…² Status City  State URL  
##   <chr>                    <chr>        <chr>     <dbl> <chr>  <chr> <chr> <chr>
## 1 Arizona State University 33°25′15″N … $1.47 …  142029 Public Tempe "AZ\… http…
## 2 Auburn University        32°36′11″N … $1.05 …   33015 Public Aubu… "AL\… http…
## 3 Baylor University        31°32′53″N … $1.97 …   20626 Priva… Waco  "TX\… http…
## 4 Binghamton University    42°05′20″N … $ 148.…   18148 Public Vest… "NY\… http…
## 5 Boston College           42°20′06″N … $3.3 b…   15106 Priva… Ches… "MA\… http…
## 6 Boston University        42°20′56″N … $3.2 b…   36729 Priva… Bost… "MA\… http…
## # … with abbreviated variable names ¹​Endowment, ²​TotalStudents

Now things are going to get a little more complicated. In this repository, you’ll find a file called ivyleague.csv. It contains information about ivy league universities. As you will have noticed, ivy leagues are included in our larger table, but there’s some information in our csv file that wikipedia cannot give us.

We will have to combine it in some way to have in our main table:

Let’s first read the data, and then use a regular expression to match the partial names in the csv file with those of the wider table.

ivys <- read.csv("ivyleague.csv")
# if a university matches the name partially in the large table and its private, then allocate 
# "yes" to ivy status in a new column of the large table
# I manually exclude Teachers College at Columbia University for simplicity  
merged_uni_table$IsIvy <- ifelse(
  grepl(paste0(".*?(", paste(ivys$uni_name, collapse = "|"), ").*"), 
        merged_uni_table$Name, ignore.case = TRUE) &
    !grepl("Teachers College at Columbia University", merged_uni_table$Name, ignore.case = TRUE) &
    merged_uni_table$Status == "Private (non-profit)",
  "Yes",
  "No")
#now adding the full names to the ivys table so that they can be merged and add information 
#about the counties and the ein
ivys$Name <- sapply(ivys$uni_name, function(uni) {
  matching_names <- merged_uni_table$Name[grepl(uni, merged_uni_table$Name, ignore.case = TRUE) & 
                                            merged_uni_table$Status == "Private (non-profit)"]
  
  if (length(matching_names) > 0) {
    return(matching_names[1])
  } else {
    return(NA)
  }
})

Now that we’ve done some data manipulation, we can merge the two tables and clean up the variable names. This way, when we upload it into our relational database, everything will match nicely. Make sure to concatenate the county and the state–it will make life easier later!

merged_uni_table <- left_join(merged_uni_table, ivys[, c("Name", "ein", "county")], by = c("Name" = "Name"))

#rename columns 
merged_uni_table <- merged_uni_table %>%
  rename(EIN_ivys = ein, County_ivys = county)

# Drop unnecessary columns from ivys data frame
ivys <- ivys[, !(names(ivys) %in% c("ein", "county"))]

#concatenating the county and state for all variables in merged_uni_table, where county won't show if info not available
#this will allow the table to be easily updated if county information is provided about any other university 
merged_uni_table <- merged_uni_table %>%
  mutate(county_state = ifelse(
    is.na(County_ivys),
    State,
    paste(County_ivys, State, sep = ", ") 
  ))%>%
  select(-County_ivys)

We have a nice table with all the information about ivy-league universities and R1, R2 universities. Let’s write it to the database!

#writing table to database
dbWriteTable(demo_db, "R1R2_uni_list", merged_uni_table)

Scraping World Rankings

We’ll now be looking at a different aspect of universities that was not available on the Wikipedia page (hopefully you can start to see how this is an issue you’d encounter daily in a Data Science job!). We’ll be collecting the university rankings for all the ivy league institutions. We’ll use the ARWU page for this.

We’ll make a function to collect the ranking for the university for the years 2003, 2013, and 2023. If you have a quick browse on the website, you’ll notice that some rankings are provided as a range; eg. 77-100. In these cases, we’ll need to take the midpoint and record that as the ranking.

# here we'll need the RSelenium package
library(RSelenium)

scrapeIvyRankings <- function() {
  list_ivy <- c("Harvard University", "Princeton University", "Yale University", 
                "Columbia University", "University of Pennsylvania", "Brown University", 
                "Dartmouth College", "Cornell University")

  Ivy_rankings <- data.frame(
    University = character(),
    Ranking = character()
  )
  rank_url <- "https://www.shanghairanking.com/"
  # Start the Selenium server:
  rD <- rsDriver(browser=c("firefox"), verbose = F, port = netstat::free_port(random = TRUE), chromever = NULL) 
  driver <- rD[["client"]] # note this alternative but equivalent call for setting the driver client

  # Navigate to the selected URL address
  driver$navigate(rank_url)
  
  # This step only need to happen once, as we can then easily just navigate to the date 

  ranking_page <- driver$findElement(using = "xpath", value = '//*[@id="arwu"]/div[1]/button')
  ranking_page$clickElement()
  
  for (i in list_ivy) { #for all the ivys
      for (year in c(2003, 2013, 2023)) { #for all the years
        # select the scrollable element of year at the top of the page
        date_selector <- driver$findElement(using = 'class name', value = 'inputWrapper')
        Sys.sleep(1)
        date_selector$clickElement()
        Sys.sleep(1)
        
        # select year of interest in sequence, using modular xpath (2024-year of interest)
        date_element_xpath <- paste0('//*[@id="bar-content"]/div[1]/div/div[2]/ul/li[', (2024 - year), ']')
        date_element <- driver$findElement(using = "xpath", value = date_element_xpath)
        
        date_element$clickElement()
        Sys.sleep(1)
        
        # Select the search bar for universities, clear it and type each university one by one
        search_bar <- driver$findElement(using = "class", value = "search-input")
        search_bar$clearElement()
        search_bar$sendKeysToElement(list(i))
        Sys.sleep(1)
        search_bar$sendKeysToElement(list(key = "enter"))
        Sys.sleep(2)
  
        # Scrape the ranking
        select_ranking <- driver$findElement(using = 'xpath', value = '//*[@id="content-box"]/div[2]/table/tbody/tr/td[1]/div')
        
        ranking <- as.character(select_ranking$getElementText())
  
        # Store the result in Ivy_rankings
        Ivy_rankings <- do.call(rbind, list(Ivy_rankings, 
                                            data.frame(University = i, 
                                                       Year = year, 
                                                       Ranking = ranking)))    
        }
  }
  # closing the driver
  driver$close()
  rD$server$stop()
  return(Ivy_rankings)
}

Now let’s call the function, do some cleaning of the rankings and write it to our database!

# Call the function to get Ivy League rankings
Ivy_rankings <- scrapeIvyRankings()

# clean the results before outputting the final table
# to turn the rankings to numbers and find midpoint for range measures
# I round the ranking to 0 decimal places, since half points in ranking are not substantively significant

ARWU_ivy_ranking <- Ivy_rankings %>%
  mutate(
    Ranking = ifelse(grepl("-", Ranking), gsub("-", " ", Ranking), Ranking),
    Ranking = sapply(strsplit(Ranking, " "), function(x) if(length(x) > 1) sum(as.numeric(x)) / 2 else as.numeric(x)),
    Ranking = round(Ranking,0)
  )
# writing table to database
dbWriteTable(demo_db, "ARWU_ivy_ranking", ARWU_ivy_ranking)

Now that we’ve gathered the universitys overall ranking, let’s gather the subject-specific rankings. Since this is a social science project, we’ll gather the social science rankings for the year 2023 of each ivy. As a difficulty in this task, we have to deal with the fact that not all universities offer all social science subjects. For this reason, we’ll scrape the whole table for social sciences, and the missing subjects will just be excluded from the table.

social_science_rank <- function() {
  # Input your list of universities here
  list_ivy <- c("Harvard University", "Princeton University", "Yale University", 
                "Columbia University", "University of Pennsylvania", "Brown University", 
                "Dartmouth College", "Cornell University")

  # Initialize the result table
  result_table <- data.frame()

  # Start the RSelenium driver
  rD <- rsDriver(browser = c("firefox"), verbose = FALSE, port = netstat::free_port(random = TRUE), chromever = NULL)
  driver <- rD[["client"]]

  # Navigate to the selected URL address
  driver$navigate("https://www.shanghairanking.com/")

  for (i in list_ivy) {
    # Navigate to the university page
    university_page <- driver$findElement(using = "xpath", value = '//*[@id="__layout"]/div/div[1]/div[1]/div/div[2]/ul/li[3]/a')
    university_page$clickElement()
    Sys.sleep(0.5)

    # Search for the search bar, clear it and type the university name
    large_search_bar <- driver$findElement(using = "class", value = 'input')
    large_search_bar$clickElement()
    Sys.sleep(1)
    
    large_search_bar$clearElement()
    large_search_bar$sendKeysToElement(list(i))
    Sys.sleep(1)
    
    large_search_bar$sendKeysToElement(list(key = "enter"))
    Sys.sleep(1)

    # Navigate to the social sciences page
    select_uni_page <- driver$findElement(using = "class", value = 'univ-main')
    select_uni_page$clickElement()
    Sys.sleep(1)

    social_science_select <- driver$findElement(using = "class", value = "inputWrapper")
    social_science_select$clickElement()
    Sys.sleep(0.5)

    social_science_click <- driver$findElement(using = "xpath", value = '//*[@id="gras"]/div[2]/div[1]/div[1]/div[2]/div/div[2]/ul/li[last()]')
    social_science_click$clickElement()
    Sys.sleep(0.5)

    # Extract rankings table
    social_sciences_rankings <- driver$findElement(using = 'class name', value = "table-container")
    rankings_html <- read_html(social_sciences_rankings$getElementAttribute('innerHTML')[[1]])

    # Transform the HTML table into a data frame
    rankings_table <- html_table(rankings_html)[[1]]

    # Add University column
    rankings_table$University <- i
 
    # Append to the result table
    result_table <- rbind(result_table, rankings_table)
  }

  # Return the result table and close driver
  driver$close()
  rD$server$stop()
  return(result_table)
}

Now we can call the function, as usual, and save it to our database after some cleaning. Again, here we have some ranks that are ranges, so we adjust these.

# Call to function 
social_rank_table <- social_science_rank()

# If there's a range provided I take the mean ranking
# I round to 0 decimal places because "12.5" ranking is not substantively meaningful compared to 13 or 12
# Especially since the main interest of ranking is to compare universities with each other 

clean_social_science <- social_rank_table %>%
  mutate(
    Rank = ifelse(grepl("-", Rank), gsub("-", " ", Rank), Rank),
    Rank = sapply(strsplit(Rank, " "), function(x) if(length(x) > 1) sum(as.numeric(x)) / 2 else as.numeric(x)),
    Rank = round(Rank, 0)
  )
dbWriteTable(demo_db, "Social_Science_Ivy_Ranking", clean_social_science)

Gathering API information

We’ve scraped websites and written web-crawlers. Now we use APIs to gather data (where possible, it’s always preferable to use APIs).

We’ll first work with the ProPublicaAPI where we’ll use the Organization Method to obtain for each ivy league university the:

For the years 2011-2021. When reading the API documentation, it becomes clear we have to use the EIN of the ivy universities for recovering the data. Below is the function which accesses the API.

# We'll need the httr package
library(httr)

get_ivy_info <- function() {
  # Create an empty list to store individual data frames
  results_list <- list()

  # Assuming merged_uni_table is your dataset
  # Replace "IsIvy" with the actual column name in your dataset
  eins_list <- merged_uni_table %>%
    filter(IsIvy == "Yes") %>%
    pull("EIN_ivys")  # Replace "EIN_ivys" with the actual column name in your dataset

  # Loop through Ivy League universities and fetch information
  for (ein in eins_list) {
    api_url <- paste0('https://projects.propublica.org/nonprofits/api/v2/organizations/', ein, '.json')

    # Make the GET request
    response <- GET(api_url)

    # Check if the request was successful (status code 200)
    if (status_code(response) != 200) {
      cat("Error:", status_code(response), "\n")
      cat("--------------------------------------------------\n")
      next()  # Skip to the next iteration if there's an error
    }

    data <- content(response, "parsed")

    # Extract relevant data
    ein_data <- data.frame(
      ein = data$organization$ein,
      year = sapply(data$filings_with_data, function(x) x$tax_prd_yr),
      revenue = sapply(data$filings_with_data, function(x) x$totrevenue),
      assets = sapply(data$filings_with_data, function(x) x$totassetsend),
      stringsAsFactors = FALSE
    )

    results_list <- bind_rows(results_list, ein_data)
  }

  return(results_list)
}
ivy_res <- get_ivy_info()

We’ve now used the method outlined in the API documentation to gather our information and formatted it into a table inside the function. That leaves running the function and cleaning the output table (and write it to the database of course).

# Cleaning the API output to be just the columns of interest 
ivy_res_cleaned <- ivy_res %>%
  mutate(ein = as.character(ein)) %>% #make sure variables are the same type to make merging easier 
  left_join(
    merged_uni_table %>%
      mutate(EIN_ivys = as.character(EIN_ivys)) %>%
      select(EIN_ivys, Name),
    by = c("ein" = "EIN_ivys")
  ) %>%
  select(Name, ein, year, revenue, assets)


# writing table to database
dbWriteTable(demo_db, "ivy_fiscal_info", ivy_res_cleaned)

Another common way to interact with APIs is packaged APIs. This is the case for the tidycensus package in R, which conveniently allows us to interact with US census data. The documentation is available here. Note that to interact with it, you’ll need to create an account and save your key as you would with any other API in an .env file.

I follow the documentation closely to retrieve the names of all the Counties in the US and their estimated median household income for every county for both 2015 and 2020 (based on the American Community Survey (ACS)).

library(tidycensus)
# Read the key from my envs folder
readRenviron("tidycensusapi.env")
census_api_key <- Sys.getenv("tidycensus_key")

# choosing the county geography and the year 2015
# variable B19013_001 for the median household income adjusted for inflation 
housing_income2015 <- get_acs(geography = "county", 
              variables = c(medincome = "B19013_001"), 
              year = 2015)
housing_income2020 <- get_acs(geography = "county", 
              variables = c(medincome = "B19013_001"), 
              year = 2020)

# I take ivy universities and merge them with the 2015 housing income table. For this, I have to match the first part of the county
housing_info2015 <- merged_uni_table %>%
  filter(IsIvy == "Yes") %>%
  mutate(county_state_match = str_extract(county_state, "^[^,]+, [A-Z]")) %>%
  left_join(housing_income2015 %>%
              select(NAME, estimate) %>%
              mutate(county_state_match = str_extract(NAME, "^[^,]+, [A-Z]")),
            by = "county_state_match") %>%
  distinct(county_state_match, .keep_all = TRUE) %>%
  select(Name, county_state, estimate) %>%
  rename("2015" = estimate) %>%
  mutate(county_state = str_replace(county_state, "\n", ""))


# Repeat the process but adding 2020 as well
# Then we can pivot it longer
housing_info2020 <- housing_info2015 %>%
  mutate(county_state_match = str_extract(county_state, "^[^,]+, [A-Z]")) %>%
  left_join(housing_income2020 %>%
              select(NAME, estimate) %>%
              mutate(county_state_match = str_extract(NAME, "^[^,]+, [A-Z]")), 
            by = "county_state_match") %>%
  distinct(county_state_match, .keep_all = TRUE) %>%
  rename("2020" = estimate) %>%
  select(Name, county_state, "2015", "2020")

Now that we’ve gathered the data from the API, and dealt with the differing county names, we can pivot it longer to have tidy-long data. We’ll also write it to the database.

housing_info_long <- housing_info2020 %>%
  pivot_longer(cols = c("2015", "2020"), names_to = "year", values_to = "estimate")

# Write in db
dbWriteTable(demo_db, "ivy_county_income", housing_info_long)

Interacting with the Database

We’ve gathered a lot of information about universities and ivys! Now is the time to make full use of our database and retrieve information combinations that are interesting.

For this doing, we’ll use SQL and its integrated R version (RSQLite) to retrieve information across tables into a single table. We’ll save that table and use it for our analysis, this way, we don’t have to filter through information.

In the SQL query below, I join all the tables we made thus far into one and save it. This requires a series of JOIN commands. I retrieve these fields:

# Connecting to the database 
db <- dbConnect(RSQLite::SQLite(), "demo_db.db")

insights_table <- dbGetQuery(db, "WITH EndowmentCTE AS (
    SELECT
        R1R2_uni_list.Name AS University,
        (CASE
            WHEN R1R2_uni_list.Endowment LIKE '%$%' AND R1R2_uni_list.Endowment LIKE '%billion%' THEN
                CAST(REPLACE(REPLACE(R1R2_uni_list.Endowment, '$', ''), ' billion', '') AS DECIMAL(20, 2)) * 1000000000
            WHEN R1R2_uni_list.Endowment LIKE '%$%' AND R1R2_uni_list.Endowment LIKE '%million%' THEN
                CAST(REPLACE(REPLACE(R1R2_uni_list.Endowment, '$', ''), ' million', '') AS DECIMAL(20, 2)) * 1000000
            ELSE
                CAST(REPLACE(R1R2_uni_list.Endowment, '$', '') AS DECIMAL(20, 2))
        END / R1R2_uni_list.TotalStudents) AS EndowmentPerStudent,
        a.AverageOverallRanking,
        s.SocialScienceRank
    FROM R1R2_uni_list
    LEFT JOIN ARWU_Ivy_ranking ivy ON R1R2_uni_list.Name = ivy.University
    LEFT JOIN (
        SELECT
            s1.University,
            ROUND(AVG(r.Ranking), 2) AS AverageOverallRanking
        FROM Social_Science_Ivy_Ranking s1
        LEFT JOIN ARWU_ivy_ranking r ON s1.University = r.University
        GROUP BY s1.University
    ) a ON R1R2_uni_list.Name = a.University
    LEFT JOIN (
        SELECT
            s2.University,
            ROUND(AVG(s2.Rank), 2) AS SocialScienceRank
        FROM Social_Science_Ivy_Ranking s2
        WHERE s2.Subject IN ('Economics', 'Political Sciences', 'Sociology')
        GROUP BY s2.University
    ) s ON R1R2_uni_list.Name = s.University
    WHERE ivy.University IS NOT NULL
    GROUP BY s.University
),

CountyIncomeCTE AS (
    SELECT
        Name AS University,
        county_state,
        ROUND(AVG(estimate), 2) AS AverageCountyIncome
    FROM ivy_county_income
    GROUP BY Name, county_state
),

RevenuePerStudentCTE AS (
    SELECT
        i.Name,
        ROUND(AVG(i.revenue / COALESCE(r.TotalStudents, 1)), 2) AS AverageRevenuePerStudent
    FROM ivy_fiscal_info i
    LEFT JOIN R1R2_uni_list r ON i.Name = r.Name
    GROUP BY i.Name
)

SELECT
    e.University,
    c.county_state AS County,
    e.EndowmentPerStudent,
    e.AverageOverallRanking,
    e.SocialScienceRank,
    c.AverageCountyIncome,
    r.AverageRevenuePerStudent
FROM EndowmentCTE e
LEFT JOIN CountyIncomeCTE c ON e.University = c.University
LEFT JOIN RevenuePerStudentCTE r ON e.University = r.Name;
")

Good news, we’ve finished with our data collection and cleaning. We can now do some fun plotting with ggplot! We’ll plot the following:

ggplot is by far the easiest way to do this:

library(ggplot2)
# Plot 1: Average university ranking vs. average Econ/PS/Soc ranking
ggplot(insights_table, aes(x = AverageOverallRanking, y = SocialScienceRank)) +
  geom_point(size = 4, color = "hotpink") +
  labs(title = "Figure 1: Average University Ranking vs. Average Econ/PS/Soc Ranking",
       x = "Average University Ranking",
       y = "Average Econ/PS/Soc Ranking") +
  geom_smooth(method='lm', col = "blue", linetype = "dotted", fill = "lightpink", alpha = 0.2) + #adding regression line 
  annotate("text", x = 200, y = 65, label = "Regression Line", color = "black", size = 3) +  #labelling regression line 
  theme_minimal()

# Plot 2: Average university ranking vs. endowment per student
ggplot(insights_table, aes(x =AverageOverallRanking , y = EndowmentPerStudent)) +
  geom_point(size = 5, color = "hotpink") +
  labs(title = "Figure 2: Average University Ranking vs. Endowment Per Student (USD) ",
       x = "Average University Ranking ",
       y = "Endowment Per Student (USD)")+
  geom_smooth(method='lm', se = FALSE, col = "blue", linetype = "dotted") +
  scale_y_continuous(labels = scales::comma)+
  annotate("text", x = 200, y =  400000, label = "Regression Line", color = "black", size = 3) + 
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

# Plot 3: Endowment per student vs. average median household income
lab_pos <- data.frame(University =c("Harvard University", "Princeton University", "Yale University", 
                "Columbia University", "University of Pennsylvania", "Brown University", 
                "Dartmouth College", "Cornell University"), 
                x = c(2300000, 3800000, 3400000, 620000, 800000, 800000, 1300000, 500000), 
                y = c(92000, 83000, 72000, 84000, 47000, 53000, 65000, 59000))
ggplot(insights_table, aes(x = EndowmentPerStudent, y = AverageCountyIncome, size = EndowmentPerStudent)) +
  geom_point(aes(color = University), alpha = 0.7, show.legend = FALSE) +
  scale_size_continuous(range = c(3, 12)) +  #dot size range
  labs(title = "Figure 3: Endowment Per Student vs. Average Median Household Income",
       x = "Endowment Per Student (USD)",
       y = "Average Median Household Income (USD - Adjusted for Inflation)",
       caption = "Bubble size = Endowment Per Student") +
  scale_x_continuous(labels = scales::comma)+
  scale_y_continuous(labels = scales::comma)+ 
  geom_text(data = lab_pos, aes(x = x, y = y, label = University), size = 3)+
  theme_minimal() +
  theme(legend.position = "bottom", legend.title = element_blank())

# Plot 4: Average revenue per student vs. average median household income
# put data into long format so that it's easier to make levels for the data
insights_long <- insights_table %>%
  mutate(University = factor(University)) %>%
  pivot_longer(cols = c(AverageRevenuePerStudent, AverageCountyIncome),
               names_to = "Variable", values_to = "Values")

# grouped bar plot 
ggplot(data = insights_long, aes(x = University, y = Values, fill = Variable)) +
  geom_bar(stat = "identity", position = position_dodge(width = 0.9), width = 0.7) +
  labs(title = "Figure 4: Average Revenue per Student vs. Average Median Household Income",
       x = NULL, y = "USD") +
  scale_y_continuous(labels = scales::comma, limits = c(0, 550000)) +
  scale_x_discrete(labels = NULL) +
  theme_minimal() +
  facet_grid(. ~ Variable) +
  geom_text(aes(label = University),
            position = position_dodge(width = 0.9),
            size = 3,
            vjust = 0.5,
            angle = 90,
            hjust = -0.3)+
  guides(fill = FALSE)

# Plot 5: Average revenue per student vs. average university ranking
lab_pos1 <- data.frame(University = c("Harvard University", "Princeton University", "Yale University", 
                "Columbia University", "University of Pennsylvania", "Brown University", 
                "Dartmouth College", "Cornell University"), 
                x = c(315000, 230000, 365000, 135000, 280000, 140000, 190000, 190000), 
                y = c(0, 7, 21, 25, 30, 50, 210, 26))

ggplot(insights_table, aes(x = AverageRevenuePerStudent, y = AverageOverallRanking, size = AverageOverallRanking, fill = University)) +
  geom_point(aes(color = University), alpha = 0.7, show.legend = FALSE) +
  labs(title = "Figure 5: Average Revenue Per Student vs. Average University Ranking",
       x = "Average Revenue Per Student (USD)",
       y = "Average University Ranking", 
       caption = "Bubble size = Ranking (large bubble = low ranking)") +
  scale_x_continuous(labels = scales::comma) +
  geom_text(data = lab_pos1, aes(x = x, y = y, label = University), size = 3)+
  scale_size_continuous(range = c(10, 3)) +  # Adjust the range for bubble sizes
  theme(legend.position = "none")+
  theme_minimal()+ theme(legend.position = "bottom", legend.title = element_blank())

Mapping

The graphs we just plotted were informative and fun, but what about an interactive plot? Here we’ll make a map that shows:

For this doing, we’ll first need another SQL query that will give us:

We’ll use the tmap package and the tigris package to retrieve a shapefile of the US.

# Connect to database and use simple SQL query to get the information 
db<-dbConnect(RSQLite::SQLite(), "demo_db.db")
university_data <- dbGetQuery(db, "SELECT
           Name, 
           Coordinates, 
           Status, 
           IsIvy
        FROM R1R2_uni_list
        ")
options(tigris_use_cache = TRUE) #follow the documentation on how to retrieve data from package

us_map <- tigris::states(class = "sf") # retrieve shapefile from data

# separate the latitude and longitude 
processed_data <- university_data %>%
  mutate(
    Coordinates = sapply(strsplit(Coordinates, "\\s+"), function(x) paste(x[1], x[2], sep = ",")), 
    Latitude = sapply(strsplit(Coordinates, ","), `[`, 1),
    Longitude = sapply(strsplit(Coordinates, ","), `[`, 2)
  ) %>%
  select(-Coordinates)

# function which transforms DMS coordinates into decimals 
dms_to_decimal <- function(coord) {
  # Extract degrees, minutes, seconds, and direction using regex
  parts <- strsplit(coord, "[^0-9.]+")[[1]]

  # Convert parts to numeric values
  degrees <- as.numeric(parts[1])
  minutes <- ifelse(length(parts) >= 3, as.numeric(parts[2]), 0)
  seconds <- ifelse(length(parts) >= 4, as.numeric(parts[3]), 0)
  
  # calculate decimal degrees
  decimal <- degrees + minutes / 60 + seconds / 3600

  return(decimal)
}
# load the library tmap
library(tmap)
# apply to data
processed_data$Latitude <- sapply(processed_data$Latitude, dms_to_decimal)

# multiply by -1 to have the negative longitude (West)
processed_data$Longitude <- sapply(processed_data$Longitude, dms_to_decimal) * (-1)

# Filter out rows with missing coordinates
coordinates_data <- processed_data[complete.cases(processed_data[, c("Longitude", "Latitude")]), ]

# Convert IsIvy to a factor with levels rather than a character object
coordinates_data$IsIvy <- factor(coordinates_data$IsIvy, levels = c("No", "Yes"))

library(sf)
# Convert coordinates_data to sf for compatibility 
sf_coordinates_data <- st_as_sf(coordinates_data, coords = c("Longitude", "Latitude"), crs = 4326)

# create a first layer for the public and private universities
status_layer <- tm_shape(sf_coordinates_data %>% filter(IsIvy == "No")) + 
  tm_dots(
    size = 0.1, 
    col = "Status", 
    palette = c("royalblue", "lightgreen"), 
    legend.show =  FALSE
  )
# create a second layer for the ivys
ivy_layer <- tm_shape(sf_coordinates_data %>% filter(IsIvy == "Yes")) + 
  tm_dots(size = 0.1, 
          col = "IsIvy", 
          palette = c("red"),
          legend.show = FALSE)

# get map
tm_us <- tm_shape(us_map) + 
  tm_borders(alpha = 0.03)

# create the legend
legend <- tm_add_legend(type = "fill", col = c("lightgreen", "royalblue", "red"), 
                        labels = c("Public", "Private (non-profit)", "Ivy"), 
                        title = "University Type")

# create interactive map 
map <- tm_us + status_layer + ivy_layer + 
  tm_layout(title = "Universities in the United States", legend.position = c("left", "bottom")) +
  legend

# Display the map
tmap_mode("view")
map